In [3]:

    
import pandas as pd
import numpy as np

Load Amazon dataset

1.

Load the dataset consisting of baby product reviews on Amazon.com. Store the data in a data frame products.



In [4]:

    
products = pd.read_csv('amazon_baby.csv')

Perform text cleaning

2.

We start by removing punctuation, so that words "cake." and "cake!" are counted as the same word.



In [16]:

    
products = products.fillna({'review':''})  # fill in N/A's in the review column



In [17]:

    
def remove_punctuation(text):
    import string
    return text.translate(None, string.punctuation) 

products['review_clean'] = products['review'].apply(remove_punctuation)



In [18]:

    
products.head(5)









    Out[18]:






  
    
      
      name
      review
      rating
      review_clean
    
  
  
    
      0
      Planetwise Flannel Wipes
      These flannel wipes are OK, but in my opinion ...
      3
      These flannel wipes are OK but in my opinion n...
    
    
      1
      Planetwise Wipe Pouch
      it came early and was not disappointed. i love...
      5
      it came early and was not disappointed i love ...
    
    
      2
      Annas Dream Full Quilt with 2 Shams
      Very soft and comfortable and warmer than it l...
      5
      Very soft and comfortable and warmer than it l...
    
    
      3
      Stop Pacifier Sucking without tears with Thumb...
      This is a product well worth the purchase.  I ...
      5
      This is a product well worth the purchase  I h...
    
    
      4
      Stop Pacifier Sucking without tears with Thumb...
      All of my kids have cried non-stop when I trie...
      5
      All of my kids have cried nonstop when I tried...

Extract Sentiments

3.

We will ignore all reviews with rating = 3, since they tend to have a neutral sentiment.



In [19]:

    
products = products[products['rating'] != 3]

4.

Now, we will assign reviews with a rating of 4 or higher to be positive reviews, while the ones with rating of 2 or lower are negative. For the sentiment column, we use +1 for the positive class label and -1 for the negative class label. A good way is to create an anonymous function that converts a rating into a class label and then apply that function to every element in the rating column. In SFrame, you would use apply():



In [21]:

    
products['sentiment'] = products['rating'].apply(lambda rating : +1 if rating > 3 else -1)



In [37]:

    
products.head(3)









    Out[37]:






  
    
      
      name
      review
      rating
      review_clean
      sentiment
    
  
  
    
      1
      Planetwise Wipe Pouch
      it came early and was not disappointed. i love...
      5
      it came early and was not disappointed i love ...
      1
    
    
      2
      Annas Dream Full Quilt with 2 Shams
      Very soft and comfortable and warmer than it l...
      5
      Very soft and comfortable and warmer than it l...
      1
    
    
      3
      Stop Pacifier Sucking without tears with Thumb...
      This is a product well worth the purchase.  I ...
      5
      This is a product well worth the purchase  I h...
      1

Split into training and test sets

5.

Let's perform a train/test split with 80% of the data in the training set and 20% of the data in the test set. If you are using SFrame, make sure to use seed=1 so that you get the same result as everyone else does. (This way, you will get the right numbers for the quiz.)



In [43]:

    
import json
with open('test_data_idx.json') as test_data_file:    
    test_data_idx = json.load(test_data_file)
with open('train_data_idx.json') as train_data_file:    
    train_data_idx = json.load(train_data_file)

print train_data_idx[:3]
print test_data_idx[:3]









    



[0, 1, 2]
[8, 9, 14]



In [46]:

    
train_data = products.iloc[train_data_idx]
train_data.head(2)









    Out[46]:






  
    
      
      name
      review
      rating
      review_clean
      sentiment
    
  
  
    
      1
      Planetwise Wipe Pouch
      it came early and was not disappointed. i love...
      5
      it came early and was not disappointed i love ...
      1
    
    
      2
      Annas Dream Full Quilt with 2 Shams
      Very soft and comfortable and warmer than it l...
      5
      Very soft and comfortable and warmer than it l...
      1



In [45]:

    
test_data = products.iloc[test_data_idx]
test_data.head(2)









    Out[45]:






  
    
      
      name
      review
      rating
      review_clean
      sentiment
    
  
  
    
      9
      Baby Tracker&reg; - Daily Childcare Journal, S...
      This has been an easy way for my nanny to reco...
      4
      This has been an easy way for my nanny to reco...
      1
    
    
      10
      Baby Tracker&reg; - Daily Childcare Journal, S...
      I love this journal and our nanny uses it ever...
      4
      I love this journal and our nanny uses it ever...
      1

Build the word count vector for each review

6.

We will now compute the word count for each word that appears in the reviews. A vector consisting of word counts is often referred to as bag-of-word features. Since most words occur in only a few reviews, word count vectors are sparse. For this reason, scikit-learn and many other tools use sparse matrices to store a collection of word count vectors. Refer to appropriate manuals to produce sparse word count vectors. General steps for extracting word count vectors are as follows:

Learn a vocabulary (set of all words) from the training data. Only the words that show up in the training data will be considered for feature extraction.
Compute the occurrences of the words in each review and collect them into a row vector.
Build a sparse matrix where each row is the word count vector for the corresponding review. Call this matrix train_matrix.
Using the same mapping between words and columns, convert the test data into a sparse matrix test_matrix.



In [47]:

    
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(token_pattern=r'\b\w+\b')
     # Use this token pattern to keep single-letter words
# First, learn vocabulary from the training data and assign columns to words
# Then convert the training data into a sparse matrix
train_matrix = vectorizer.fit_transform(train_data['review_clean'])
# Second, convert the test data into a sparse matrix, using the same word-column mapping
test_matrix = vectorizer.transform(test_data['review_clean'])
#print vectorizer.vocabulary_

Train a sentiment classifier with logistic regression

7.

Learn a logistic regression classifier using the training data. If you are using scikit-learn, you should create an instance of the LogisticRegression class and then call the method fit() to train the classifier. This model should use the sparse word count matrix (train_matrix) as features and the column sentiment of train_data as the target. Use the default values for other parameters. Call this model sentiment_model.



In [48]:

    
from sklearn.linear_model import LogisticRegression
sentiment_model = LogisticRegression()
sentiment_model.fit(train_matrix, train_data['sentiment'])









    Out[48]:





LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

8.

There should be over 100,000 coefficients in this sentiment_model. Recall from the lecture that positive weights w_j correspond to weights that cause positive sentiment, while negative weights correspond to negative sentiment. Calculate the number of positive (>= 0, which is actually nonnegative) coefficients.

Quiz question:

How many weights are >= 0?



In [53]:

    
np.sum(sentiment_model.coef_ >= 0)









    Out[53]:





85928

Answer

85928

Making predictions with logistic regression

9.

Now that a model is trained, we can make predictions on the test data. In this section, we will explore this in the context of 3 data points in the test data. Take the 11th, 12th, and 13th data points in the test data and save them to sample_test_data. The following cell extracts the three data points from the SFrame test_data and print their content:



In [60]:

    
sample_test_data = test_data.iloc[10:13]
print sample_test_data









    



                                                 name  \
59                          Our Baby Girl Memory Book   
71  Wall Decor Removable Decal Sticker - Colorful ...   
91  New Style Trailing Cherry Blossom Tree Decal R...   

                                               review  rating  \
59  Absolutely love it and all of the Scripture in...       5   
71  Would not purchase again or recommend. The dec...       2   
91  Was so excited to get this product for my baby...       1   

                                         review_clean  sentiment  
59  Absolutely love it and all of the Scripture in...          1  
71  Would not purchase again or recommend The deca...         -1  
91  Was so excited to get this product for my baby...         -1



In [61]:

    
sample_test_data.iloc[0]['review']









    Out[61]:





'Absolutely love it and all of the Scripture in it.  I purchased the Baby Boy version for my grandson when he was born and my daughter-in-law was thrilled to receive the same book again.'



In [62]:

    
sample_test_data.iloc[1]['review']









    Out[62]:





'Would not purchase again or recommend. The decals were thick almost plastic like and were coming off the wall as I was applying them! The would NOT stick! Literally stayed stuck for about 5 minutes then started peeling off.'



In [66]:

    
sample_test_matrix = vectorizer.transform(sample_test_data['review_clean'])
scores = sentiment_model.decision_function(sample_test_matrix)
print scores

print sentiment_model.predict(sample_test_matrix)









    



[  5.60988975  -3.13630894 -10.4141069 ]
[ 1 -1 -1]

Prediciting Sentiment

11.

These scores can be used to make class predictions as follows:

Using scores, write code to calculate predicted labels for sample_test_data.

Checkpoint: Make sure your class predictions match with the ones obtained from sentiment_model. The logistic regression classifier in scikit-learn comes with the predict function for this purpose.

Probability Predictions



In [75]:

    
print [1./(1+np.exp(-x)) for x in scores]









    



[0.99635188446958933, 0.041634146264237344, 3.0005288630612952e-05]



In [76]:

    
print sentiment_model.classes_
print sentiment_model.predict_proba(sample_test_matrix)









    



[-1  1]
[[  3.64811553e-03   9.96351884e-01]
 [  9.58365854e-01   4.16341463e-02]
 [  9.99969995e-01   3.00052886e-05]]

Quiz question:

Of the three data points in sample_test_data, which one (first, second, or third) has the lowest probability of being classified as a positive review?

Answer:

third

Find the most positive (and negative) review

13.

We now turn to examining the full test dataset, test_data, and use sklearn.linear_model.LogisticRegression to form predictions on all of the test data points.

Using the sentiment_model, find the 20 reviews in the entire test_data with the highest probability of being classified as a positive review. We refer to these as the "most positive reviews."

To calculate these top-20 reviews, use the following steps:

Make probability predictions on test_data using the sentiment_model.
Sort the data according to those predictions and pick the top 20.

Quiz Question:

Which of the following products are represented in the 20 most positive reviews?



In [87]:

    
test_scores = sentiment_model.decision_function(test_matrix)
positive_idx = np.argsort(-test_scores)[:20]
print positive_idx
print test_scores[positive_idx[0]]
test_data.iloc[positive_idx]









    



[18112 15732 24286 25554 24899  9125 21531 32782 30535  9555 14482 30634
 17558 26830 11923 20743  4140 30076 33060 26838]
53.8185477823






    Out[87]:






  
    
      
      name
      review
      rating
      review_clean
      sentiment
    
  
  
    
      100166
      Infantino Wrap and Tie Baby Carrier, Black Blu...
      I bought this carrier when my daughter was abo...
      5
      I bought this carrier when my daughter was abo...
      1
    
    
      87017
      Baby Einstein Around The World Discovery Center
      I am so HAPPY I brought this item for my 7 mon...
      5
      I am so HAPPY I brought this item for my 7 mon...
      1
    
    
      133651
      Britax 2012 B-Agile Stroller, Red
      [I got this stroller for my daughter prior to ...
      4
      I got this stroller for my daughter prior to t...
      1
    
    
      140816
      Diono RadianRXT Convertible Car Seat, Plum
      I bought this seat for my tall (38in) and thin...
      5
      I bought this seat for my tall 38in and thin 2...
      1
    
    
      137034
      Graco Pack 'n Play Element Playard - Flint
      My husband and I assembled this Pack n' Play l...
      4
      My husband and I assembled this Pack n Play la...
      1
    
    
      50315
      P'Kolino Silly Soft Seating in Tias, Green
      I've purchased both the P'Kolino Little Reader...
      4
      Ive purchased both the PKolino Little Reader C...
      1
    
    
      119182
      Roan Rocco Classic Pram Stroller 2-in-1 with B...
      Great Pram Rocco!!!!!!I bought this pram from ...
      5
      Great Pram RoccoI bought this pram from Europe...
      1
    
    
      180646
      Mamas &amp; Papas 2014 Urbo2 Stroller - Black
      After much research I purchased an Urbo2. It's...
      4
      After much research I purchased an Urbo2 Its e...
      1
    
    
      168081
      Buttons Cloth Diaper Cover - One Size - 8 Colo...
      We are big Best Bottoms fans here, but I wante...
      4
      We are big Best Bottoms fans here but I wanted...
      1
    
    
      52631
      Evenflo X Sport Plus Convenience Stroller - Ch...
      After seeing this in Parent's Magazine and rea...
      5
      After seeing this in Parents Magazine and read...
      1
    
    
      80155
      Simple Wishes Hands-Free Breastpump Bra, Pink,...
      I just tried this hands free breastpump bra, a...
      5
      I just tried this hands free breastpump bra an...
      1
    
    
      168697
      Graco FastAction Fold Jogger Click Connect Str...
      Graco's FastAction Jogging Stroller definitely...
      5
      Gracos FastAction Jogging Stroller definitely ...
      1
    
    
      97325
      Freemie Hands-Free Concealable Breast Pump Col...
      I absolutely love this product.  I work as a C...
      5
      I absolutely love this product  I work as a Cu...
      1
    
    
      147949
      Baby Jogger City Mini GT Single Stroller, Shad...
      Amazing, Love, Love, Love it !!! All 5 STARS a...
      5
      Amazing Love Love Love it  All 5 STARS all the...
      1
    
    
      66059
      Evenflo 6 Pack Classic Glass Bottle, 4-Ounce
      It's always fun to write a review on those pro...
      5
      Its always fun to write a review on those prod...
      1
    
    
      114796
      Fisher-Price Cradle 'N Swing,  My Little Snuga...
      My husband and I cannot state enough how much ...
      5
      My husband and I cannot state enough how much ...
      1
    
    
      22586
      Britax Decathlon Convertible Car Seat, Tiffany
      I researched a few different seats to put in o...
      4
      I researched a few different seats to put in o...
      1
    
    
      165593
      Ikea 36 Pcs Kalas Kids Plastic BPA Free Flatwa...
      For the price this set is unbelievable- and tr...
      5
      For the price this set is unbelievable and tru...
      1
    
    
      182089
      Summer Infant Wide View Digital Color Video Mo...
      I love this baby monitor.  I can compare this ...
      5
      I love this baby monitor  I can compare this o...
      1
    
    
      147996
      Baby Jogger City Mini GT Double Stroller, Shad...
      We are well pleased with this stroller, and I ...
      4
      We are well pleased with this stroller and I w...
      1

14.

Now, let us repeat this exercise to find the "most negative reviews." Use the prediction probabilities to find the 20 reviews in the test_data with the lowest probability of being classified as a positive review. Repeat the same steps above but make sure you sort in the opposite order.

Quiz Question:

Which of the following products are represented in the 20 most negative reviews?



In [86]:

    
negative_idx = np.argsort(test_scores)[:20]
print negative_idx
print test_scores[negative_idx[0]]
test_data.iloc[negative_idx]









    



[ 2931 21700 13939  8818 28184 17069  9655 14711 20594  1942  1810 10814
 13751 31226  7310 27231 28120   205 15062  5831]
-34.6348776854






    Out[86]:






  
    
      
      name
      review
      rating
      review_clean
      sentiment
    
  
  
    
      16042
      Fisher-Price Ocean Wonders Aquarium Bouncer
      We have not had ANY luck with Fisher-Price pro...
      2
      We have not had ANY luck with FisherPrice prod...
      -1
    
    
      120209
      Levana Safe N'See Digital Video Baby Monitor w...
      This is the first review I have ever written o...
      1
      This is the first review I have ever written o...
      -1
    
    
      77072
      Safety 1st Exchangeable Tip 3 in 1 Thermometer
      I thought it sounded great to have different t...
      1
      I thought it sounded great to have different t...
      -1
    
    
      48694
      Adiri BPA Free Natural Nurser Ultimate Bottle ...
      I will try to write an objective review of the...
      2
      I will try to write an objective review of the...
      -1
    
    
      155287
      VTech Communications Safe &amp; Sounds Full Co...
      This is my second video monitoring system, the...
      1
      This is my second video monitoring system the ...
      -1
    
    
      94560
      The First Years True Choice P400 Premium Digit...
      Note: we never installed batteries in these un...
      1
      Note we never installed batteries in these uni...
      -1
    
    
      53207
      Safety 1st High-Def Digital Monitor
      We bought this baby monitor to replace a diffe...
      1
      We bought this baby monitor to replace a diffe...
      -1
    
    
      81332
      Cloth Diaper Sprayer--styles may vary
      I bought this sprayer out of desperation durin...
      1
      I bought this sprayer out of desperation durin...
      -1
    
    
      113995
      Motorola Digital Video Baby Monitor with Room ...
      DO NOT BUY THIS BABY MONITOR!I purchased this ...
      1
      DO NOT BUY THIS BABY MONITORI purchased this m...
      -1
    
    
      10677
      Philips AVENT Newborn Starter Set
      It's 3am in the morning and needless to say, t...
      1
      Its 3am in the morning and needless to say thi...
      -1
    
    
      9915
      Cosco Alpha Omega Elite Convertible Car Seat
      I bought this car seat after both seeing  the ...
      1
      I bought this car seat after both seeing  the ...
      -1
    
    
      59546
      Ellaroo Mei Tai Baby Carrier - Hershey
      This is basically an overpriced piece of fabri...
      1
      This is basically an overpriced piece of fabri...
      -1
    
    
      75994
      Peg-Perego Tatamia High Chair, White Latte
      I can see why there are so many good reviews o...
      2
      I can see why there are so many good reviews o...
      -1
    
    
      172090
      Belkin WeMo Wi-Fi Baby Monitor for Apple iPhon...
      I read so many reviews saying the Belkin WiFi ...
      2
      I read so many reviews saying the Belkin WiFi ...
      -1
    
    
      40079
      Chicco Cortina KeyFit 30 Travel System in Adve...
      My wife and I have used this system in two car...
      1
      My wife and I have used this system in two car...
      -1
    
    
      149987
      NUK Cook-n-Blend Baby Food Maker
      It thought this would be great. I did a lot of...
      1
      It thought this would be great I did a lot of ...
      -1
    
    
      154878
      VTech Communications Safe &amp; Sound Digital ...
      First, the distance on these are no more than ...
      1
      First the distance on these are no more than 7...
      -1
    
    
      1116
      Safety 1st Deluxe 4-in-1 Bath Station
      This item is junk.  I originally chose it beca...
      1
      This item is junk  I originally chose it becau...
      -1
    
    
      83234
      Thirsties Hemp Inserts 2 Pack, Small 6-18 Lbs
      My Experience: Babykicks Inserts failure vs RA...
      5
      My Experience Babykicks Inserts failure vs RAV...
      1
    
    
      31741
      Regalo My Cot Portable Bed, Royal Blue
      If I could give this product zero stars I woul...
      1
      If I could give this product zero stars I woul...
      -1

Compute accuracy of the classifier

15.

We will now evaluate the accuracy of the trained classifier. Recall that the accuracy is given by

$$ accuracy=\frac{\# correctly classified examples}{ \# total examples}$$

This can be computed as follows:

Step 1: Use the sentiment_model to compute class predictions.
Step 2: Count the number of data points when the predicted class labels match the ground truth labels.
Step 3: Divide the total number of correct predictions by the total number of data points in the dataset.

Quiz Question:

What is the accuracy of the sentiment_model on the test_data? Round your answer to 2 decimal places (e.g. 0.76).

Quiz Question:

Does a higher accuracy value on the training_data always imply that the classifier is better?



In [88]:

    
predicted_y = sentiment_model.predict(test_matrix)
correct_num = np.sum(predicted_y == test_data['sentiment'])
total_num = len(test_data['sentiment'])
print "correct_num: {}, total_num: {}".format(correct_num, total_num)
accuracy = correct_num * 1./ total_num
print accuracy









    



correct_num: 31077, total_num: 33336
0.932235421166

Answer:

0.93

Learn another classifier with fewer words

16.

There were a lot of words in the model we trained above. We will now train a simpler logistic regression model using only a subet of words that occur in the reviews. For this assignment, we selected 20 words to work with.



In [91]:

    
significant_words = ['love', 'great', 'easy', 'old', 'little', 'perfect', 'loves', 
      'well', 'able', 'car', 'broke', 'less', 'even', 'waste', 'disappointed', 
      'work', 'product', 'money', 'would', 'return']



In [92]:

    
vectorizer_word_subset = CountVectorizer(vocabulary=significant_words) # limit to 20 words
train_matrix_word_subset = vectorizer_word_subset.fit_transform(train_data['review_clean'])
test_matrix_word_subset = vectorizer_word_subset.transform(test_data['review_clean'])

Train a logistic regression model on a subset of data

17.

Now build a logistic regression classifier with train_matrix_word_subset as features and sentiment as the target. Call this model simple_model.



In [93]:

    
simple_model = LogisticRegression()
simple_model.fit(train_matrix_word_subset, train_data['sentiment'])









    Out[93]:





LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

18.

Let us inspect the weights (coefficients) of the simple_model. First, build a table to store (word, coefficient) pairs. If you are using SFrame with scikit-learn, you can combine words with coefficients by running



In [102]:

    
simple_model_coef_table = pd.DataFrame({'word':significant_words,
                                         'coefficient':simple_model.coef_.flatten()})
#simple_model_coef_table
simple_model_coef_table.sort_values(['coefficient'], ascending=False)









    Out[102]:






  
    
      
      coefficient
      word
    
  
  
    
      6
      1.673074
      loves
    
    
      5
      1.509812
      perfect
    
    
      0
      1.363690
      love
    
    
      2
      1.192538
      easy
    
    
      1
      0.944000
      great
    
    
      4
      0.520186
      little
    
    
      7
      0.503760
      well
    
    
      8
      0.190909
      able
    
    
      3
      0.085513
      old
    
    
      9
      0.058855
      car
    
    
      11
      -0.209563
      less
    
    
      16
      -0.320556
      product
    
    
      18
      -0.362167
      would
    
    
      12
      -0.511380
      even
    
    
      15
      -0.621169
      work
    
    
      17
      -0.898031
      money
    
    
      10
      -1.651576
      broke
    
    
      13
      -2.033699
      waste
    
    
      19
      -2.109331
      return
    
    
      14
      -2.348298
      disappointed

Quiz Question:

Consider the coefficients of simple_model. How many of the 20 coefficients (corresponding to the 20 significant_words) are positive for the simple_model?



In [103]:

    
len(simple_model_coef_table[simple_model_coef_table['coefficient']>0])









    Out[103]:





10

Answer:

Quiz Question:

Are the positive words in the simple_model also positive words in the sentiment_model?



In [ ]:

    
model_coef_table = pd.DataFrame({'word':significant_words,
                                         'coefficient':simple_model.coef_.flatten()})
#simple_model_coef_table
simple_model_coef_table.sort_values(['coefficient'], ascending=False)



In [118]:

    
vectorizer_word_subset.get_feature_names()









    Out[118]:





['love',
 'great',
 'easy',
 'old',
 'little',
 'perfect',
 'loves',
 'well',
 'able',
 'car',
 'broke',
 'less',
 'even',
 'waste',
 'disappointed',
 'work',
 'product',
 'money',
 'would',
 'return']

Answer:

Comparing models

19.

We will now compare the accuracy of the sentiment_model and the simple_model.

First, compute the classification accuracy of the sentiment_model on the train_data.

Now, compute the classification accuracy of the simple_model on the train_data.



In [120]:

    
train_predicted_y = sentiment_model.predict(train_matrix)
correct_num = np.sum(train_predicted_y == train_data['sentiment'])
total_num = len(train_data['sentiment'])
print "correct_num: {}, total_num: {}".format(correct_num, total_num)
train_accuracy = correct_num * 1./ total_num
print "sentiment_model training accuracy: {}".format(train_accuracy)

train_predicted_y = simple_model.predict(train_matrix_word_subset)
correct_num = np.sum(train_predicted_y == train_data['sentiment'])
total_num = len(train_data['sentiment'])
print "correct_num: {}, total_num: {}".format(correct_num, total_num)
train_accuracy = correct_num * 1./ total_num
print "simple_model training accuracy: {}".format(train_accuracy)









    



correct_num: 129159, total_num: 133416
sentiment_model training accuracy: 0.968092282785
correct_num: 115648, total_num: 133416
simple_model training accuracy: 0.866822570007

Quiz Question:

Which model (sentiment_model or simple_model) has higher accuracy on the TRAINING set?

Answer:

sentiment_model

20.

Now, we will repeat this exercise on the test_data. Start by computing the classification accuracy of the sentiment_model on the test_data.

Next, compute the classification accuracy of the simple_model on the test_data.



In [122]:

    
test_predicted_y = sentiment_model.predict(test_matrix)
correct_num = np.sum(test_predicted_y == test_data['sentiment'])
total_num = len(test_data['sentiment'])
print "correct_num: {}, total_num: {}".format(correct_num, total_num)
test_accuracy = correct_num * 1./ total_num
print "sentiment_model test accuracy: {}".format(test_accuracy)

test_predicted_y = simple_model.predict(test_matrix_word_subset)
correct_num = np.sum(test_predicted_y == test_data['sentiment'])
total_num = len(test_data['sentiment'])
print "correct_num: {}, total_num: {}".format(correct_num, total_num)
test_accuracy = correct_num * 1./ total_num
print "simple_model test accuracy: {}".format(test_accuracy)









    



correct_num: 31077, total_num: 33336
sentiment_model test accuracy: 0.932235421166
correct_num: 28981, total_num: 33336
simple_model test accuracy: 0.869360451164

Quiz Question:

Which model (sentiment_model or simple_model) has higher accuracy on the TEST set?

Answer:

sentiment_model

Baseline: Majority class prediction

21.

It is quite common to use the majority class classifier as the a baseline (or reference) model for comparison with your classifier model. The majority classifier model predicts the majority class for all data points. At the very least, you should healthily beat the majority class classifier, otherwise, the model is (usually) pointless.



In [124]:

    
positive_label = len(test_data[test_data['sentiment']>0])
negative_label = len(test_data[test_data['sentiment']<0])
print "positive_label is {}, negative_label is {}".format(positive_label, negative_label)









    



positive_label is 28095, negative_label is 5241



In [125]:

    
baseline_accuracy = positive_label*1./(positive_label+negative_label)
print "baseline_accuracy is {}".format(baseline_accuracy)









    



baseline_accuracy is 0.842782577394

Quiz Question:

Enter the accuracy of the majority class classifier model on the test_data. Round your answer to two decimal places (e.g. 0.76).

Answer:

0.84

Quiz Question:

Is the sentiment_model definitely better than the majority class classifier (the baseline)?

Answer:

Yes

	name	review	rating	review_clean
0	Planetwise Flannel Wipes	These flannel wipes are OK, but in my opinion ...	3	These flannel wipes are OK but in my opinion n...
1	Planetwise Wipe Pouch	it came early and was not disappointed. i love...	5	it came early and was not disappointed i love ...
2	Annas Dream Full Quilt with 2 Shams	Very soft and comfortable and warmer than it l...	5	Very soft and comfortable and warmer than it l...
3	Stop Pacifier Sucking without tears with Thumb...	This is a product well worth the purchase. I ...	5	This is a product well worth the purchase I h...
4	Stop Pacifier Sucking without tears with Thumb...	All of my kids have cried non-stop when I trie...	5	All of my kids have cried nonstop when I tried...

	name	review	rating	review_clean	sentiment
9	Baby Tracker® - Daily Childcare Journal, S...	This has been an easy way for my nanny to reco...	4	This has been an easy way for my nanny to reco...	1
10	Baby Tracker® - Daily Childcare Journal, S...	I love this journal and our nanny uses it ever...	4	I love this journal and our nanny uses it ever...	1

	name	review	rating	review_clean	sentiment
100166	Infantino Wrap and Tie Baby Carrier, Black Blu...	I bought this carrier when my daughter was abo...	5	I bought this carrier when my daughter was abo...	1
87017	Baby Einstein Around The World Discovery Center	I am so HAPPY I brought this item for my 7 mon...	5	I am so HAPPY I brought this item for my 7 mon...	1
133651	Britax 2012 B-Agile Stroller, Red	[I got this stroller for my daughter prior to ...	4	I got this stroller for my daughter prior to t...	1
140816	Diono RadianRXT Convertible Car Seat, Plum	I bought this seat for my tall (38in) and thin...	5	I bought this seat for my tall 38in and thin 2...	1
137034	Graco Pack 'n Play Element Playard - Flint	My husband and I assembled this Pack n' Play l...	4	My husband and I assembled this Pack n Play la...	1
50315	P'Kolino Silly Soft Seating in Tias, Green	I've purchased both the P'Kolino Little Reader...	4	Ive purchased both the PKolino Little Reader C...	1
119182	Roan Rocco Classic Pram Stroller 2-in-1 with B...	Great Pram Rocco!!!!!!I bought this pram from ...	5	Great Pram RoccoI bought this pram from Europe...	1
180646	Mamas & Papas 2014 Urbo2 Stroller - Black	After much research I purchased an Urbo2. It's...	4	After much research I purchased an Urbo2 Its e...	1
168081	Buttons Cloth Diaper Cover - One Size - 8 Colo...	We are big Best Bottoms fans here, but I wante...	4	We are big Best Bottoms fans here but I wanted...	1
52631	Evenflo X Sport Plus Convenience Stroller - Ch...	After seeing this in Parent's Magazine and rea...	5	After seeing this in Parents Magazine and read...	1
80155	Simple Wishes Hands-Free Breastpump Bra, Pink,...	I just tried this hands free breastpump bra, a...	5	I just tried this hands free breastpump bra an...	1
168697	Graco FastAction Fold Jogger Click Connect Str...	Graco's FastAction Jogging Stroller definitely...	5	Gracos FastAction Jogging Stroller definitely ...	1
97325	Freemie Hands-Free Concealable Breast Pump Col...	I absolutely love this product. I work as a C...	5	I absolutely love this product I work as a Cu...	1
147949	Baby Jogger City Mini GT Single Stroller, Shad...	Amazing, Love, Love, Love it !!! All 5 STARS a...	5	Amazing Love Love Love it All 5 STARS all the...	1
66059	Evenflo 6 Pack Classic Glass Bottle, 4-Ounce	It's always fun to write a review on those pro...	5	Its always fun to write a review on those prod...	1
114796	Fisher-Price Cradle 'N Swing, My Little Snuga...	My husband and I cannot state enough how much ...	5	My husband and I cannot state enough how much ...	1
22586	Britax Decathlon Convertible Car Seat, Tiffany	I researched a few different seats to put in o...	4	I researched a few different seats to put in o...	1
165593	Ikea 36 Pcs Kalas Kids Plastic BPA Free Flatwa...	For the price this set is unbelievable- and tr...	5	For the price this set is unbelievable and tru...	1
182089	Summer Infant Wide View Digital Color Video Mo...	I love this baby monitor. I can compare this ...	5	I love this baby monitor I can compare this o...	1
147996	Baby Jogger City Mini GT Double Stroller, Shad...	We are well pleased with this stroller, and I ...	4	We are well pleased with this stroller and I w...	1

	name	review	rating	review_clean	sentiment
16042	Fisher-Price Ocean Wonders Aquarium Bouncer	We have not had ANY luck with Fisher-Price pro...	2	We have not had ANY luck with FisherPrice prod...	-1
120209	Levana Safe N'See Digital Video Baby Monitor w...	This is the first review I have ever written o...	1	This is the first review I have ever written o...	-1
77072	Safety 1st Exchangeable Tip 3 in 1 Thermometer	I thought it sounded great to have different t...	1	I thought it sounded great to have different t...	-1
48694	Adiri BPA Free Natural Nurser Ultimate Bottle ...	I will try to write an objective review of the...	2	I will try to write an objective review of the...	-1
155287	VTech Communications Safe & Sounds Full Co...	This is my second video monitoring system, the...	1	This is my second video monitoring system the ...	-1
94560	The First Years True Choice P400 Premium Digit...	Note: we never installed batteries in these un...	1	Note we never installed batteries in these uni...	-1
53207	Safety 1st High-Def Digital Monitor	We bought this baby monitor to replace a diffe...	1	We bought this baby monitor to replace a diffe...	-1
81332	Cloth Diaper Sprayer--styles may vary	I bought this sprayer out of desperation durin...	1	I bought this sprayer out of desperation durin...	-1
113995	Motorola Digital Video Baby Monitor with Room ...	DO NOT BUY THIS BABY MONITOR!I purchased this ...	1	DO NOT BUY THIS BABY MONITORI purchased this m...	-1
10677	Philips AVENT Newborn Starter Set	It's 3am in the morning and needless to say, t...	1	Its 3am in the morning and needless to say thi...	-1
9915	Cosco Alpha Omega Elite Convertible Car Seat	I bought this car seat after both seeing the ...	1	I bought this car seat after both seeing the ...	-1
59546	Ellaroo Mei Tai Baby Carrier - Hershey	This is basically an overpriced piece of fabri...	1	This is basically an overpriced piece of fabri...	-1
75994	Peg-Perego Tatamia High Chair, White Latte	I can see why there are so many good reviews o...	2	I can see why there are so many good reviews o...	-1
172090	Belkin WeMo Wi-Fi Baby Monitor for Apple iPhon...	I read so many reviews saying the Belkin WiFi ...	2	I read so many reviews saying the Belkin WiFi ...	-1
40079	Chicco Cortina KeyFit 30 Travel System in Adve...	My wife and I have used this system in two car...	1	My wife and I have used this system in two car...	-1
149987	NUK Cook-n-Blend Baby Food Maker	It thought this would be great. I did a lot of...	1	It thought this would be great I did a lot of ...	-1
154878	VTech Communications Safe & Sound Digital ...	First, the distance on these are no more than ...	1	First the distance on these are no more than 7...	-1
1116	Safety 1st Deluxe 4-in-1 Bath Station	This item is junk. I originally chose it beca...	1	This item is junk I originally chose it becau...	-1
83234	Thirsties Hemp Inserts 2 Pack, Small 6-18 Lbs	My Experience: Babykicks Inserts failure vs RA...	5	My Experience Babykicks Inserts failure vs RAV...	1
31741	Regalo My Cot Portable Bed, Royal Blue	If I could give this product zero stars I woul...	1	If I could give this product zero stars I woul...	-1

	coefficient	word
6	1.673074	loves
5	1.509812	perfect
0	1.363690	love
2	1.192538	easy
1	0.944000	great
4	0.520186	little
7	0.503760	well
8	0.190909	able
3	0.085513	old
9	0.058855	car
11	-0.209563	less
16	-0.320556	product
18	-0.362167	would
12	-0.511380	even
15	-0.621169	work
17	-0.898031	money
10	-1.651576	broke
13	-2.033699	waste
19	-2.109331	return
14	-2.348298	disappointed